.TH sdiag "1" "Slurm Commands" "July 2018" "Slurm Commands"

.SH "NAME"
.LP
sdiag \- Scheduling diagnostic tool for Slurm

.SH "SYNOPSIS"
.LP
sdiag

.SH "DESCRIPTION"
.LP
sdiag shows information related to slurmctld execution about: threads, agents,
jobs, and scheduling algorithms. The goal is to obtain data from slurmctld
behaviour helping to adjust configuration parameters or queues policies. The
main reason behind is to know Slurm behaviour under systems with a high throughput.
.LP
It has two execution modes. The default mode \fB\-\-all\fR shows several counters
and statistics explained later, and there is another execution option
\fB\-\-reset\fR for resetting those values.
.LP
Values are reset at midnight UTC time by default.
.LP
The first block of information is related to global slurmctld execution:
.TP
\fBServer thread count\fR
The number of current active slurmctld threads. A high number would mean a high
load processing events like job submissions, jobs dispatching, jobs completing,
etc. If this is often close to MAX_SERVER_THREADS it could point to a potential
bottleneck.

.TP
\fBAgent queue size\fR
Slurm design has scalability in mind and sending messages to thousands of nodes
is not a trivial task. The agent mechanism helps to control communication
between the slurm daemons and the controller for a best effort. If this values
is close to MAX_AGENT_CNT there could be some delays affecting jobs management.

.TP
\fBAgent count\fR
Number of active agent threads.

.TP
\fBDBD Agent queue size\fR
Slurm queues up the messages intended for the SlurmDBD and processes them in a
separate thread. If the SlurmDBD, or database, is down then this number will
increase. The max queue size is calculated as:

MAX(10000, ((max_job_cnt * 2) + (node_record_count * 4)))

If this number begins to grow more than half of the max queue size, the slurmdbd
and the database should be investigated immediately.

.TP
\fBJobs submitted\fR
Number of jobs submitted since last reset

.TP
\fBJobs started\fR
Number of jobs started since last reset. This includes backfilled jobs.

.TP
\fBJobs completed\fR
Number of jobs completed since last reset.

.TP
\fBJobs canceled\fR
Number of jobs canceled since last reset.

.TP
\fBJobs failed\fR
Number of jobs failed due to slurmd or other internal issues since last reset.

.TP
\fBJob states ts:\fR
Lists the timestamp of when the following job state counts were gathered.

.TP
\fBJobs pending:\fR
Number of jobs pending at the given time of the time stamp above.

.TP
\fBJobs running:\fR
Number of jobs running at the given time of the time stamp above.

.TP
\fBJobs running ts:\fR
Time stamp of when the running job count was taken.

.LP
The second block of information is related to main scheduling algorithm based
on jobs priorities. A scheduling cycle implies to get the job_write_lock lock,
then trying to get resources for jobs pending, starting from the most priority
one and going in descendent order. Once a job can not get the resources the
loop keeps going but just for jobs requesting other partitions. Jobs with
dependencies or affected  by accounts limits are not processed.

.TP
\fBLast cycle\fR
Time in microseconds for last scheduling cycle. 

.TP
\fBMax cycle\fR
Time in microseconds for the maximum scheduling cycle since last reset.

.TP
\fBTotal cycles\fR
Number of scheduling cycles since last reset. Scheduling is done in
periodically and when a job is submitted or a job is completed.

.TP
\fBMean cycle\fR
Mean of scheduling cycles since last reset

.TP
\fBMean depth cycle\fR
Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle.

.TP
\fBCycles per minute\fR
Counter of scheduling executions per minute

.TP
\fBLast queue length\fR
Length of jobs pending queue.

.LP
The third block of information is related to backfilling scheduling algorithm.
A backfilling scheduling cycle implies to get locks for jobs, nodes and
partitions objects then trying to get resources for jobs pending. Jobs are
processed based on priorities. If a job can not get resources the algorithm
calculates when it could get them obtaining a future start time for the job.
Then next job is processed and the algorithm tries to get resources for that
job but avoiding to affect the \fIprevious ones\fR, and again it calculates
the future start time if not current resources available. The backfilling
algorithm takes more time for each new job to process since more priority jobs
can not be affected. The algorithm itself takes measures for avoiding a long
execution cycle and for taking all the locks for too long.

.TP
\fBTotal backfilled jobs (since last slurm start)\fR
Number of jobs started thanks to backfilling since last slurm start.

.TP
\fBTotal backfilled jobs (since last stats cycle start)\fR
Number of jobs started thanks to backfilling since last time stats where reset.
By default these values are reset at midnight UTC time.

.TP
\fBTotal backfilled heterogeneous job components\fR
Number of heterogeneous job components started thanks to backfilling since
last Slurm start.

.TP
\fBTotal cycles\fR
Number of scheduling cycles since last reset

.TP
\fBLast cycle when\fR
Time when last execution cycle happened in format
"weekday Month MonthDay hour:minute.seconds year"

.TP
\fBLast cycle\fR
Time in microseconds of last backfilling cycle.
It counts only execution time removing sleep time inside a scheduling cycle
when it takes too much time.
Note that locks are released during the sleep time so that other work can
proceed.

.TP
\fBMax cycle\fR
Time in microseconds of maximum backfilling cycle execution since last reset.
It counts only execution time removing sleep time inside a scheduling cycle
when it takes too much time.
Note that locks are released during the sleep time so that other work can
proceed.

.TP
\fBMean cycle\fR
Mean of backfilling scheduling cycles in microseconds since last reset


.TP
\fBLast depth cycle\fR
Number of processed jobs during last backfilling scheduling cycle. It counts
every process even if it has no option to execute due to dependencies or limits.

.TP
\fBLast depth cycle (try sched)\fR
Number of processed jobs during last backfilling scheduling cycle. It counts
only processes with a chance to run waiting for available resources. These
jobs are which makes the backfilling algorithm heavier.

.TP
\fBDepth Mean\fR
Mean of processed jobs during backfilling scheduling cycles since last reset.
Jobs which are found to be ineligible to run when examined by the backfill
scheduler are not counted (e.g. jobs submitted to multiple partitions and
already started, jobs which have reached a QOS or account limit such as
maximum running jobs for an account, etc).

.TP
\fBDepth Mean (try sched)\fR
The subset of Depth Mean that the backfill scheduler attempted to schedule.

.TP
\fBLast queue length\fR
Number of jobs pending to be processed by backfilling algorithm.
A job once for each partition it requested.
A pending job array will normally be counted as one job (tasks of a job array
which have already been started/requeued or individually modified will already
have individual job records and are each counted as a separate job).

.TP
\fBQueue length Mean\fR
Mean of jobs pending to be processed by backfilling algorithm.
A job once for each partition it requested.
A pending job array will normally be counted as one job (tasks of a job array
which have already been started/requeued or individually modified will already
have individual job records and are each counted as a separate job).

.LP
The fourth and fifth blocks of information report the most frequently issued
remote procedure calls (RPCs), calls made for the Slurmctld daemon to perform
some action.
The fourth block reports the RPCs issued by message type.
You will need to look up those RPC codes in the Slurm source code by looking
them up in the file src/common/slurm_protocol_defs.h.
The report includes the number of times each RPC is invoked, the total time
consumed by all of those RPCs plus the average time consumed by each RPC in
microseconds.
The fifth block reports the RPCs issued by user ID, the total number of RPCs
they have issued, the total time consumed by all of those RPCs plus the average
time consumed by each RPC in microseconds.

.LP
The sixth block of information, labeled Pending RPC Statistics, shows
information about pending outgoing RPCs on the slurmctld agent queue.
The first section of this block shows types of RPCs on the queue and the
count of each. The second section shows up to the first 25 individual RPCs
pending on the agent queue, including the type and the destination host list.
This information is cached and only refreshed on 30 second intervals.

.SH "OPTIONS"
.LP

.TP
\fB\-a\fR, \fB\-\-all\fR
Get and report information. This is the default mode of operation.

.TP
\fB\-h\fR, \fB\-\-help\fR
Print description of options and exit.

.TP
\fB\-i\fR, \fB\-\-sort\-by\-id\fR
Sort Remote Procedure Call (RPC) data by message type ID and user ID.

.TP
\fB\-r\fR, \fB\-\-reset\fR
Reset counters. Only supported for Slurm operators and administrators.

.TP
\fB\-t\fR, \fB\-\-sort\-by\-time\fR
Sort Remote Procedure Call (RPC) data by total run time.

.TP
\fB\-T\fR, \fB\-\-sort\-by\-time2\fR
Sort Remote Procedure Call (RPC) data by average run time.

.TP
\fB\-\-usage\fR
Print list of options and exit.

.TP
\fB\-V\fR, \fB\-\-version\fR
Print current version number and exit.

.SH "ENVIRONMENT VARIABLES"
.PP
Some \fBsdiag\fR options may be set via environment variables. These
environment variables, along with their corresponding options, are listed below.
(Note: commandline options will always override these settings)
.TP 20
\fBSLURM_CONF\fR
The location of the Slurm configuration file.

.SH "COPYING"
Copyright (C) 2010-2011 Barcelona Supercomputing Center.
.br
Copyright (C) 2010\-2017 SchedMD LLC.
.LP
Slurm is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
.LP
Slurm is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
details.

.SH "SEE ALSO"
.LP
sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
